Importing packages

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Preparing the data

  • Opening file
  • Normalising data
  • Dividing into train and test datasets
In [2]:
df = pd.read_csv("kc_house_data.csv").set_index('id')
df['date'] = df['date'].apply(lambda x: int(x.split('T')[0]))
df.astype('float')

norm_df = (df - df.mean()) / df.std()
label = norm_df.pop('price')

train, test, labels_train, labels_test = train_test_split(norm_df, label, train_size=0.80)

train.columns
Out[2]:
Index(['date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

Creating random forest and linear regression models

  • Fitting the model
  • Predicting test data
In [3]:
def classifier(model):
    model.fit(train, labels_train)
    pred = model.predict(test)
    print(f"MSE: {mean_squared_error(pred, labels_test)}, MAE: {mean_absolute_error(pred, labels_test)}")
    return (model, pred)
In [4]:
model_rf, pred_rf = classifier(RandomForestRegressor(n_estimators = 100, random_state = 0))
model_lin, pred_lin = classifier(LinearRegression())
MSE: 0.12002199092384985, MAE: 0.18138663507142141
MSE: 0.26549235700713747, MAE: 0.33627265919543226

LIME decomposition of one observation

  • Taking two observations from test dataframe and their predictions
  • Generationg LIME decompositions
In [5]:
import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(train, 
                                                   mode='regression',
                                                   feature_names=train.columns,
                                                   discretize_continuous=False
                                                  )

def exp_rf_inst(i):
    exp = explainer.explain_instance(train.iloc[i], model_rf.predict)
    exp.show_in_notebook(show_table=True, show_all=False)
In [6]:
exp_rf_inst(10)
In [7]:
exp_rf_inst(100)
In [8]:
exp_rf_inst(1000)

After comparing 3 observations: the answers are stable. In all of those sqft_living is best or second best. First 3 variables are same (permutated) in all observations, then fourth is long for all and is negative for all. Also, values are similar.

LIME comparison between linear regression and random forest models

In [9]:
def exp_lin_inst(i):
    exp = explainer.explain_instance(train.iloc[i], model_lin.predict)
    exp.show_in_notebook(show_table=True, show_all=False)
In [10]:
exp_lin_inst(10)
In [11]:
exp_rf_inst(10)

There are many differences between LIME decompositions between models for 10th observation. Biggest difference is in sqft_above, which is negligible negative in random forest, but large value in linear model.

In [ ]: